Goto

Collaborating Authors

 unknown language


Language Generation in the Limit

Neural Information Processing Systems

Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from $L$ in the limit if after some finite point in the enumeration of $L$, the agent is able to produce new elements that come exclusively from $L$ and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.


Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Karita, Shigeki, Koizumi, Yuma, Zen, Heiga, Ishikawa, Haruko, Scheibler, Robin, Bacchiani, Michiel

arXiv.org Artificial Intelligence

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.


Language Generation in the Limit

Neural Information Processing Systems

Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages.


Probing Large Language Models in Reasoning and Translating Complex Linguistic Puzzles

Lin, Zheng-Lin, Shih, Yu-Fei, Hsieh, Shu-Kai

arXiv.org Artificial Intelligence

This paper investigates the utilization of Large Language Models (LLMs) for solving complex linguistic puzzles, a domain requiring advanced reasoning and adept translation capabilities akin to human cognitive processes. We explore specific prompting techniques designed to enhance ability of LLMs to reason and elucidate their decision-making pathways, with a focus on Input-Output Prompting (IO), Chain-of-Thought Prompting (CoT), and Solo Performance Prompting (SPP). Utilizing datasets from the Puzzling Machine Competition and various Linguistics Olympiads, we employ a comprehensive set of metrics to assess the performance of GPT-4 0603, a prominent LLM, across these prompting methods. Our findings illuminate the potential of LLMs in linguistic reasoning and complex translation tasks, highlighting their capabilities and identifying limitations in the context of linguistic puzzles. This research contributes significantly to the broader field of Natural Language Processing (NLP) by providing insights into the optimization of LLM applications for improved reasoning and translation accuracy, thereby enriching the ongoing dialogue in NLP advancements.


Three archaeological mysteries AI could soon solve - including cracking an unknown language on Bronze Age tablets

Daily Mail - Science & tech

The uncanny ability of artificial intelligence to spot patterns in large amounts of data could finally unravel some of the thorniest mysteries of the ancient world. Researchers working with companies such as IBM and Google's Deepmind are on the brink of deciphering ancient texts once thought unreadable - and even'cracking' an unknown language from almost two millennia before the birth of Christ. AI allows researchers to sift through images far faster than human beings, and the techniques could answer fundamental questions about the history of language and potentially uncover lost works by Greek and Roman writers. A mysterious unknown language, 'Linear A' discovered on tablets in Crete in 1900 has never been deciphered - but AI might be able to crack the code. Among the world's most famous examples of unknown languages, stones and tablets written in the strange'LInear A' language is considered the main script used by the Minoan civilization, a Bronze Age kingdom led by King Minos.


Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

Gupta, Akshat, Li, Xinjian, Rallabandi, Sai Krishna, Black, Alan W

arXiv.org Artificial Intelligence

With recent advancements in language technologies, humansare now interacting with technology through speech. To in-crease the reach of these technologies, we need to build suchsystems in local languages. A major bottleneck here are theunderlying data-intensive parts that make up such systems,including automatic speech recognition (ASR) systems thatrequire large amounts of labelled data. With the aim of aidingdevelopment of dialog systems in low resourced languages,we propose a novel acoustics based intent recognition systemthat uses discovered phonetic units for intent classification.The system is made up of two blocks - the first block gen-erates a transcript of discovered phonetic units for the inputaudio, and the second block which performs intent classifi-cation from the generated phonemic transcripts. Our workpresents results for such a system for two languages families- Indic languages and Romance languages, for two differentintent recognition tasks. We also perform multilingual train-ing of our intent classifier and show improved cross-lingualtransfer and performance on an unknown language with zeroresources in the same language family.


How a Mysterious Manuscript Keeps Confounding AI

#artificialintelligence

Playbook for the Cult of Isis, herbal health instructions, details of the benefits of therapeutic bathing, or a written history of speaking in tongues. Probably, but not as much as most are when they come face to face with the Voynich Manuscript. That each of the above is a proposed theme for its indecipherable scribbles indicates the level of confusion. A brief meditation on the sentence "a written history of speaking in tongues" should also help you get in the confusion ballpark. Written in the early parts of the 15th century, the manuscript, a 240-page compendium of seemingly illegible and likely codified text, has amassed a proud track record of confounding scholars and eminent code breakers, including Alan Turing, alike.


Inside the Race to Build a Brain-Machine Interface--and Outpace Evolution

WIRED

In an ordinary hospital room in Los Angeles, a young woman named Lauren Dickerson waits for her chance to make history. She's 25 years old, a teacher's assistant in a middle school, with warm eyes and computer cables emerging like futuristic dreadlocks from the bandages wrapped around her head. Three days earlier, a neurosurgeon drilled 11 holes through her skull, slid 11 wires the size of spaghetti into her brain, and connected the wires to a bank of computers. Now she's caged in by bed rails, with plastic tubes snaking up her arm and medical monitors tracking her vital signs. She tries not to move.